Assembly, individual missingness and filtering

The assembly was created using only the diploid samples, and with the Yukon population removed (given its apparent distance from the rest of the sexual populations). SNPs were then called on the diploid reference using Freebayes, which is ploidy aware. This generated >400k SNPs, which were whittled down to ~16k SNPs using the following filtering steps:

# big filter. This filters based on quality, minor allele count, minor allele frequency, average depth (low and high), excludes indels and SNPs with more than 2 alleles, and removes SNPs that have >5% missing data. 
~/bcftools/bcftools filter TotalRawSNPs.vcf | bcftools filter -e "QUAL < 30 & MQM < 30 & MQMR < 30 || AVG(GQ) < 20 || F_MISSING > 0.05 || MAC < 3 || MAF <0.05 || AVG(FMT/DP)<20 || AVG(FMT/DP)>132.5" | bcftools view --max-alleles 2 --exclude-types indels --output-type v --output-file filter1.vcf

# allele balance
vcffilter -s -f "AB > 0.125 & AB < 0.875 | AB < 0.01" filter1.vcf > filter2.vcf

# SNPs found on both strands
vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s filter2.vcf > filter3.vcf

# ratio of mapping qualities between reference and alternate alleles
vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" filter3.vcf > filter4.vcf

# discrepancy in properly paired status of reads supporting reference or alternate alleles
vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s filter4.vcf > filter5.vcf

# ratio of locus quality score to depth
vcffilter -f "QUAL / DP > 0.25" filter5.vcf > final.filtered.snps.vcf

Clone identification

Using poppr, we were able to identify the same number of MLGs as the IBS (from SNPRelate), but automatically and with the full ploidy information. One thing to note is that it grouped the sexual Yukon pop into the same MLG, which makes sense given that they had higher IBS values. The determination of MLGs in the apo’s should not be biased by the distance metric used (I’m using the basic R dist() function with “euclidean” method) as the within-population pairwise comparisons are always between individuals of the same ploidy.



I also calculated pairwise distances within populations and plotted them. This is similar to the IBS plot I made for my AGA poster, but using euclidian distance (raw dissimilarity) instead of IBS (% similarity). The R package I was using to calculate IBS (SNPRelate) was diploidizing my data without telling me, so I’ve switched to adegenet/poppr which can handle multiple ploidy levels. However, it’s also worth noting that Meirmans et al (2018) report that this exact type of comparison (diploid-diploid vs polyploid-polyploid) is biased, so we should take these results with a grain of salt…

Ploidy confirmation

This is a cool way to confirm ploidy using genomic data found here: https://knausb.github.io/vcfR_documentation/determining_ploidy_1.html. Basically, you can harness the ‘DP’ (sequencing depth) field from your VCF files to calculate the ratio of times each allele was sequenced in each individual. The idea is that if you sequence a diploid heterozygote (A/T) at 30x coverage, we would expect to sequence each allele around 15 times (giving an allele balance of 1/2). For a triploid (A/T/T), we’d expect to sequence the ‘A’ 10 times and the ‘T’ 20 times, giving allele balances of 1/3 and 2/3. If calculate the mean allele balances for each individual and plot them, you’d thus expect to see a single peak at 1/2 for diploids, two peaks (1/3 and 2/3) for triploids, and 3 peaks (1/4, 1/2, and 3/4) for tetraploids. Here’s what it looks like for our data:

DAPC

Here’s a bunch of DAPC’s using different groups / colors.

Using MS as the two groups (Density plot of PC1)

Using MS as the two groups (Density plot of PC1)



Pops as groups, but colored by mating system

Pops as groups, but colored by mating system



Pops as groups - all pops

Pops as groups - all pops



Pops as groups - sub C59-S SM-A C23-A S03-A

Pops as groups - sub C59-S SM-A C23-A S03-A



Here’s a plot of the posterior membership probabilities:

K-means clustering

This is hot off the press. I used a K-means approach from the adegenet package (following this tutorial: http://grunwaldlab.github.io/Population_Genetics_in_R/clustering_plot.html), which uses a clustering algorithm that is similar to STRUCTURE. Plot A below is based on 50 runs of the find.clusters algorithm, and shows that ~10-13 clusters have the lowest BIC. Plot B is a scatterplot of the discriminant functions (K=10), which helps us see how different the resulting clusters are. Plot C shows barplots of the posterior probability group assignments for K = 10-13, which helps visualize how the groups are assigned under different values of K. The K-means results are more-or-less consistent with our other findings, as they assign clones to pretty much the same groups. Keep in mind that the groups/colors are assigned arbitrarily, so the colors aren’t consistent between the the different K plots at the moment (working on fixing this…).

A new paper from Patrick Meirmans’ group suggests that STRUCTURE is better than K-means and other clustering approaches for mixed-ploidy populations. However, this is only really a problem (for K-means) if you have low numbers of markers and missing dosage information…fortunately for us we have full dosage and tons of markers. My understanding is that DAPC is still better for us because it uses model-free methods, while STRUCTURE assumes panmixia and that markers are not linked.

MSN

This is a minimum spanning network. It can apparently be a more powerful visualization tool for clonal organisms than trees. Each pie represents a MLG, and the size of the pie represents the number of individuals assigned to that MLG. Blue pies are sexual, red are apo.



Poppr function - genotype richness/diversity/evenness stats

This shows us several summary statistics:

Abbreviation Statistic
Pop Population name.
N Number of individuals observed.
MLG Number of multilocus genotypes (MLG) observed.
eMLG The number of expected MLG at the smallest sample size ≥ 10 based on rarefaction
SE Standard error based on eMLG.
H Shannon-Wiener Index of MLG diversity
G Stoddart and Taylor’s Index of MLG diversity
lambda Simpson’s Index
E.5 Evenness, \(E_5\)
Hexp Nei’s unbiased gene diversity
Ia The index of association, \(I_A\)
rbarD The standardized index of association

Here are the results from the poppr function on the populations.

##      Pop   N MLG     H    G lambda   E.5       Ia  p.Ia     rbarD  p.rD
## 1  B42-S   3   3 1.099 3.00  0.667 1.000    9.401 0.001  1.68e-03 0.001
## 2  B46-S   3   3 1.099 3.00  0.667 1.000  154.674 0.001  2.52e-02 0.001
## 3  B49-S   3   3 1.099 3.00  0.667 1.000   25.641 0.001  4.39e-03 0.001
## 4  B53-S   3   3 1.099 3.00  0.667 1.000    2.368 0.031  4.05e-04 0.031
## 5  B60-S   3   3 1.099 3.00  0.667 1.000    2.806 0.030  5.59e-04 0.030
## 6  C59-S   3   1 0.000 1.00  0.000   NaN   75.414 0.001  3.18e-02 0.001
## 7  L05-S   3   3 1.099 3.00  0.667 1.000  522.314 0.001  7.98e-02 0.001
## 8  L08-S   3   3 1.099 3.00  0.667 1.000    3.491 0.015  4.91e-04 0.014
## 9  L10-S   3   3 1.099 3.00  0.667 1.000    3.801 0.017  5.11e-04 0.016
## 10 L11-S   3   3 1.099 3.00  0.667 1.000   -0.434 0.880 -5.62e-05 0.883
## 11 L12-S   3   3 1.099 3.00  0.667 1.000    0.195 0.330  2.71e-05 0.326
## 12 L13-S   3   3 1.099 3.00  0.667 1.000    5.724 0.001  8.04e-04 0.001
## 13 L45-S   1   1 0.000 1.00  0.000   NaN       NA    NA        NA    NA
## 14 L62-S   4   4 1.386 4.00  0.750 1.000   -0.264 0.723 -3.06e-05 0.728
## 15 C23-A   5   1 0.000 1.00  0.000   NaN   51.792 0.001  2.08e-02 0.001
## 16 C27-A   5   1 0.000 1.00  0.000   NaN   13.832 0.001  6.61e-03 0.001
## 17 C43-A   5   1 0.000 1.00  0.000   NaN  433.823 0.001  9.14e-02 0.001
## 18 C85-A   5   1 0.000 1.00  0.000   NaN   66.818 0.001  2.50e-02 0.001
## 19 C86-A   5   1 0.000 1.00  0.000   NaN   29.491 0.001  1.35e-02 0.001
## 20 C87-A   5   1 0.000 1.00  0.000   NaN  976.004 0.001  2.33e-01 0.001
## 21 C88-A   5   1 0.000 1.00  0.000   NaN    5.107 0.001  2.25e-03 0.001
## 22 L06-A   5   1 0.000 1.00  0.000   NaN   42.656 0.001  1.89e-02 0.001
## 23 L16-A   5   2 0.673 1.92  0.480 0.961 2870.226 0.001  3.10e-01 0.001
## 24 L17-A   5   1 0.000 1.00  0.000   NaN   11.574 0.001  5.73e-03 0.001
## 25 L39-A   5   2 0.500 1.47  0.320 0.725 3161.707 0.001  4.94e-01 0.001
## 26 L41-A   4   1 0.000 1.00  0.000   NaN  129.139 0.001  4.98e-02 0.001
## 27 L45-A   4   1 0.000 1.00  0.000   NaN   11.345 0.001  6.57e-03 0.001
## 28 L62-A   1   1 0.000 1.00  0.000   NaN       NA    NA        NA    NA
## 29 S03-A   4   1 0.000 1.00  0.000   NaN    7.464 0.001  3.80e-03 0.001
## 30  SM-A   5   1 0.000 1.00  0.000   NaN  121.604 0.001  2.86e-02 0.001
## 31 Total 114  49 3.096 9.30  0.892 0.393 5759.122 0.001  3.69e-01 0.001

NJ Tree on populations

Here’s a NJ tree on the populations. It uses provesti distance, which is probably biased, but it’s a start…

Map